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X .Jdo different formats for multiple- choice test items 

were -compared dn an experimental test given in* a college class in 
introd-uDtory t.^ychDlDgy • In - one f or,m^a^,^^a ^ guest io.n or> incomplete 
statement Vas followe'd. by four answers of completions, . only one yf , 
whi^cff yas correct. In the other format^ the/double multiple-ch6i::e 
.versiODr the same questions were used together w^th many of the-same 
aHQSwers: but any nlimber of these answers might -be correct and tha" 
second set /of choic^^ required the test taker to choose* a respO/Sse" 
that sh^^ed which answer^ or' combination of answers was right* ^fhs 
results^f this experiment showed the double multiple^chodce f or » at 
to be more difficult and less discriminating thar. the standard 
forma t* ^ The' two 20 -itein siibtests made up of double multiple^bhoi^e 
it^ems -were 'Significantly less reliable than the parallel 2 0*itega 



subtests made up|gf, st*^ndard format items: however, th 
'differences in th^ validities of the . two formats i^ pr 
final cij^uf se grades* (C 
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author iS intJebt^d to Nate Clark for inakiiig snfejects available an<4 to 
Bruce Korth for Ms advl<:e regarding this study. Portions of this article 
were presented .at the 1979 meeting of the American Psychological Association/- 
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'BouMe Miltiple-C3iQice ' Questions .. 
ABSTRACT, •• * 

Dotjdjle mltiple-choice cpestions include two sets of^ options; from tHe. •. 
latter set the .sub-ject must select the option >diich id^itifies the correct 
or best collection of optiOTis in the first set: A con5)arxson, of double as. * 
opposed to single inultiple -choice questions ms made on a classroom psychology 
exam taken by 47 undergraduates. Forty pairs of matched single and double 
format questions were 'prepared and 20, items < of each 45j^e appeared on each of 
two forms of the exam. * Significant differences vr^e observed between the two 
question types in regard to' item 'difficulty, item discriminatiori, snd internal 
reliability but not concurrent validity. The res\ilts were interpreted as 
' suggestive that double multiple-choice questions., may increase variance ■ 
attributable to certain subject characteristics other than content mastery. 



RELATIVE EITECTIVEN^ OF SINOE AND IX:UBLE . . | ' 
^ MULTIPLE-CHOICE QUESTIONS IN EDUCATIC^IAL MEASURENeTT . .. • 

- The present study examined the effects of employing "doifljle multiple -\ 
choice" questions* as opposed to eonventiongl "single miltiple-choice" ^ 
questions in educational measurement. The term 'Moi^Ie multiple-choice'^ 

is used \o describiB' questions of the form se«i in thi e^rample below which 
was drawn froiiiJthe 1977-78 Law School Admission Btaietin and L§AT Study . ; 

Guide. , - ■ ■ ■ ■ • ■ ■ , ■ 

14. Which of the following can be inferred from the 
graphs? I 

1. The J^sh population of North J^merica is 

- larger than the Jewish population of any 

♦ .other continent. , - 

/ II. Of the six continents, Asia has the greatest 
' mmi)er of people who' profess no religion. ' 

III. South America has a larger population of / 
Roman Cktholics than any other conti^ 

* ~ A. None 

; ^ B. I only • " 

C. HI only. • 

D. I and HI only . v 
• E.' I, .II, and III 

As can be seen above, double multiple-choice questions include two sets of 

options. From the latter set, the subject must select the option which 

identifies the correct or best collectiCTi of options listed in the first 

.set. In contrast, a single multiple-choice question involves? only on^ set 

"of ."optitans • ^ ' ' ^ ' ^ " 7^ ' ' ' ^ '' 7' ^ ^' ' T 

It is apparent that double multiple -choice questions share some 

continuity with the rather common practice of using options such as "all -of 

the above" in conventional raultiple-choice questims with only one set of 

options. However, the practice of including two con^jlete sets of options 

appears* to represent a substantial departure from conventional item writing 



Dottle Multiple-Choice Questions 

• ^ _ . , • • , * • • 3 '., 

procedures which laertis en5}irical exaanination. . / 

Although the author has no systematic evidence, the popularity of , 

double multiple-choice questions aj^ars ^to be increasing and quisstions . 

of this type now appear on iirportant admissions- tests (e.g. and 

licensing examinations (e.g. Illinois State Board Test Pool Exam for 

Registered Nurses) . An extensive search of the educational measurement 

literature uncovered no eyidwice on the relatiirev merits of double as 

opposed to. single multiple-choice items. This dearth of enpirical evidence 

is not surprising. Although there is a substantial literature on rules for 

writing multiple-choice questions (cf. Gronlund, 196S1 Wood, 1960) this 

literature has virtually no satirical data base. Thomdike .andJHagen (19S5) 

provided an erudite description of the situation almost 25 years ago vMch 

unforunately is still accurate today. 

^ The point we wish to make is that we do not have 
a science of. test construction. The guides and 
maxims we will offer (for item construction) are ' 
not tested out by controlled scioitif ic experimen- 
- tation. Rather y they represent a distillatioti of 
practical experience and professional -judgment, 
(p. SO). 

The reasons for this paucity of research on item construction are unclear. 
The problem of writing soraid test items obviously has qsnsiderable practical 
relevance to important endeavors in educatiOTial measurement, and does n6t 
seem to be inpeded by any particularly formidable medthodological roadbloclb. 
While there apparently W no en5)irical evidence , there are a number of 
admittedly speculatiire. reascms ^^or suspecting -double and single multiple- 
choice items mi^t be differentially effective . -^'I'^rst , certain characteristics 
of the doid^le format questions are inconsisteht ^vith some of . the conventional 
item writing guidelines (Ebel, 1965; Gronlund, 1965). For instance, double 
foxroat questions tend to conflict with conventional i^dsdom which puts a 
premium on item brevity and cautions against writing multiple-choice questions 



in which the options function as a series o£ true-false propositioiiis.' 
Second, one could argue that the lengthier snd more' complex double format 
ciuestions ndght be more sensitive to individual differences in reading ^ - 
ability or reasoning skill. Since . an , educational test is 'usually designed ^ 
to measure some cmtent . kiowledge, additional score, variation attributable » 
to^ differences in reading or reasoning .slcill would represent an increment 
in error variance. Third, there is evidence that subjects' e^ performance 
may be influenced by^heir test-wiseness (Ntillman, Bishop, § Eb^l, 1965; 
Rowley, 1974). It seems reasonable to conjecture that -the more elaborate 
double format items, may be more sensitive to these individual differences' 
in test- taking skill. . - . - *- . 

v.. , 

Cognizant of the possibilities putlined above, the present study compared 
lingle and double maltipl^-choice questions irr regard to difficulty, item 
discrimination, reliability, and validity. Based on- anecdotal evidaice gleaned 
from students • common Complaints that doiable format items-are excessively ^ 
difficult, it was hypothesized that item difficulr/ indices wotdd be lower' for 
the double formt questions th^i for the single fornst questions. Basei on 
reasoning that, double format' items may introduce additional Soin?Ces of error 
variance, it, was hypothesized that* the dottle multiple-chQice questions would 
yield lower item discrimination indices than the coi^^tional questions and 
that siAtests consisting of these dotible foipat items would diplay lower • 
reliability and validity than the matching single format subtests*. |f 
- --rv ^ ■ ; IdJiOD.-. • - ' ' ■ ,^ - 

Si&jects and Procedure *' ^ * * 

Subjects were tmdergraduates enrolled in a large introductory psycholog)' 
course who were requiored to participate in -an experiment of their choice. 
Forty- seven students £28 female, 19 male), elected to participate in the present 
study- One week after the first of two exams, in the course; the subjects were 



Boijtole; Maltipie^C3ioice Questions 



^sembl«d otatside of class ajidc^ere adndmstered one of two ejqjerimenttal 
exams' covering the same rfe^ing assignment as the first exam in- the course. • 
ThuSj sub j-eots were tested on adtual course roateriai that l±.ey had recently 
studied. Subjects .were unaware of tib nature of the study until they arrived 
for the experimental session. At that time, the-nauire of the e^qjerijnental 
test was '.explaing4> ^nd they ^ were infoimed that the results jWould not have, 
aiiy effect \jpon Uheir course grilie. The two forms of the experimental test 
were passed ^gt ^ternately so that. 24 subjects responded to Foim A and 
23 subjects responed to Form B*" 
Test Construction - 

The itemis. for the two forms of the experimental, test iver© dr3wn frowi 
the instiiictbr's nmual designed to accon^saiy the^reqiiired course textbook, 
PsychologS^ Today: An Introduction (1975) . The selection of items from the- 
pogl of nHiltiple-choice questions available for the assigned chapters was 
not random. All items previously used Hy the course instructor (not the . 
author) on the actual mid- term exam had to be eliminated. Furthermore, since 
all selected itansivere to be rewfitten into a double multiple -choice ^ & 
the author had to esiercise some siabjective judgment in choosing items which 
could be sensibly ti^nsfofmed into the double .foronat. For e^^ of the 
conventional questicais drawn^frtM the test item pool , a sijip.lar^ double multiple 
choice it«n was constructed. JSfetched items in? the" two formats had identical- - 
stems, and in most eases t±Le single set of options in the conv^ticaial question 

, was exactly the .saine . as the first set of options m the double, imiltiple-choi^^ 

✓ ' ^ ' ^ ' , — ^ ^ ' - ' ' , . . . ~ 

Version! Exact id^tity for all matched questions ivas inpossible to achieve 

because the co^iventionai -d^e^stions each had only one correct option, whereas 

at. least some of the doiible format . questions' had to have more than one correct 

option in the first set. so that the correct . answer in the second set ccjuld 

involve a collection of options. 



' ,\ ' Double Maltiple-C3ioice Questloi^ 

Two exainples of itetched items can be seen beXbw. In the first exanple, 
there is exact identity. between the' only .set of options in the sisigle fonwit 
version and the fi^st set of options in the double format versioiy In the 
second example, th^re is |^li^t difference Tjetween the first set of options 
in, the double format and "die only set of optioiis in the single format. 
Example One 

Single multiple -choice version ' . / ' 

Which of the following approau±es postulates that children are tau^t sex roles 
through conditioning and modeling? . - 

a. Psychoanalytic . \ . » 

*b. Social -learning ■ 

c. Cognitive, ) 

d. Humanistic 

Double multiple-choice version , . 

Which of the following approaches postulates that children are taught sex roles 
through conditioning and modeling? f 

1. ; Psychoanalytic * . 

2. Social -learning f 

3. Cognitive ■ . • 

4. HumanistiG . > 

a, 1 only. - ■ • ' * - • 

*b. 2 only ' ~ 

c. ' 2 and' 3 only ■ - - 

d. 2, 3, and 4 only. * . • 
Exaropie 1^ 

Single mult^le -choice version . " ^ - , 

The study of psychology, is in?>ortant to everyone because it provides: 

a. a liew perspective from which to view daily events of life 

. .......insi^t7-into-one--s-awn--%ehavior , •-7 - ■ 

C. practical information ^ : & 

*d- ail oJ?^e above 1 v 



♦The stuify of psychol«©r is important to everyone because it provides: , ^ 

1. a nwperspective from which to view daily events oir life 

2, insight into one* s own bebaivior 

3, practical information ... . , • 

4. solutions to everyone's problems » 

a. 1. only 

b. 1 and 3 (3ay • ' . > 

*c. U 2, and 3 only " - 

d. 1, 2, -3, and> 4 , " 

Care was exercised in the construction of questions with discrepancies 
such as tha^ illusi:rated m the second exaa^le so as to maintain the essential 
diaracter of the original question. This, the oaLy significaixt disparity 
between matched items was -the . addition of the second set of optiqns in the 
double multiple- choice format. 

Two sets of 40 matched itans were develc^ed in this manner. Twenty items 

. - . • . - ' ' ' '. ■ ■ ' . 

were then randomly assigned to appear in single format on Form A and in ^ 

double fonnat on Fbrin B. The remaining twenty iwms were presented in single / ^ 

format on Form B and in doiibl^ format on Form A. The order of. items >ras ^ 

' • * ■ .. ; ■ » ' • '"-^ ■ 

identical for the two forms of the experim^tal test and corresponded to the 

■ • ■ . . " ' - ' . ^ 4; *' , • 

order of items in the iiistructor's manual (as well. as the order of "coverage 
.in the tekt). * , • , - 

Dependent Variables . . * ' \ . 

Four dependent Variables were examined. Item difficulty indices repres^ted 
the proportion of subjects correctly janswering an item, so that lower figures 



were indicative of gre^^ter difficmfe Since "both types of qi^estions were 
ostensibly legitimate measures of content mastery , item-whole point -biserial ' 
correlaticais were conpited to provide iti6m discrimination indices. To a^ess 
■'internal reliability, each 40 item test was divided into a pa^r of 20 itej| 
subtests cqn^josed exclusively of one type of itoa foiuat. -This *neces sit iated 
two coi^arisons, one between the single items on Form A and the matched double 



items an Fothi B, and one between the> double items on Foiift A and the* 



corrtsponiing single items on Fotm B. Concurrent .validity was measi^red 
by correlatijig each .subject ^s single and "doiAle subtest scores with that 
subject's cumtdatiTm point total :for the entire cowse. • y 

RESULTS AND DISCUSSION . \ , ' , 

Item Difficulty • \ . ' 

. Mean item difficulty for the 40" single raultiple-choice queetions was 

" ^633 while the 40 double multiple-choice' questions yielded a mean .of ^. 575,. 
A directional t test for differences between correlated means revealed that 
this diff^rencp was significant (t » 1.74, df * 39, p<.p5) . The. observed 
difference jis consistent vdth the canmori con?)laint of ejcainijie|s ^t -the 
dpoble foralat" items . arer more challenging than traditional single fotmat itenS. 
This finding. also makes sense in view ^f "the seeraijigiy more oomplex nature^ . 

. of double multiple -chaice questions. " 

However, it should be pointed out that disparity in item difficulty is 
not crucial to the issue ofdif/erentiar effectiveness in educational measurement 
Although doiijle format questions may l^nd to be more, difficult than coir5)arable 
single format question^ it should be stressed that it^ difficulty can be 
manipulated (through the judicious selection and carefiil writing of options) -to . 
a desired level or optimal range wit'hin either format. Nonetheless, the data 
suggest that students* dislilce of double forWt questions is not sinply a* 
matter of resisting the unfamiliar. These (ifiestions do appear, to confront the 
examihee m a more diffiaitt task. 

Item Dis criminalkLon , . 

-r' , <tm . - ---- / ^ I 

' The point-biserial correlations used to estimate item discrimination do 
iJDt/ represent interval data and they usually are characterized by a decided.ly 
skewed distxifc^tion. iTierefore, a non-parametric test OVilcoxon's T> was 
used to compare the distributiflks of item' discrimination indices. Median , . 



iv>s>s; >:r'h->- ^^=^'fcs- T>- ; v: 



■ ■ .... 



i;.. • ■ •. • ,-. ■ A* ■ ' .' '■' •' • >■■■ * ■ ;■ . ■ ■ ■ ■. \ ■ 

i :i '"^ Jl'disafijainatio^^ doii^li items was .34 and .23 resj)ectively. 

• • H iV . : items . . , 

^ '''^aiscrM^e^^ better :f{T *l2?4. Si p<^. 05) tl^ did the double- 



;eric 



^ foTiiBt ittos*^ insofat' ihe expe^iaentil^test^^^^^ 

^ -the disparit)^'in item discTiiniim^aiJn s^ " ' 

■ ' did^a better job than the dotible-format questions iri ^distiiiguisliing. between v 



well informed and |xx)rly{ 'informed' students. 
.! Reliability - -.^ . > y ^ " - , v ': . ' •, 

■ The significance of the two cqpqjarisons of tJ^e 20 item si^tests in " , / ; 

regard to reliability was pseSsed ^th Feldt's (1$69) W- which is approximately • 
distributed as F. In the fir^l cbni)aiison,, pVzo r^iabiii^^^ 

subtest was .71 as cbiEpared^to ^,38 Sor ^ J^-do*le si^est (U^ 2,11^ df «.22, . 
J- 23vp,<.05). In the se^nd,coni^risbn,.'^-2^^^ ' 

B- single sub^st^S^45 for the 'if^-dikible^ubt^l^t ^ = T:52, df - 23 , 22, p>.05). 
Thus in both cases,* internal relilabillty'^^ 

\ ^ " format subtest, although only one of the diferences was sigfiificaiitv / : ■ J 

, ' ■ ^ ■ ■ " ■ ^ ■ > ~ , ' ■;. ■ ■■„,,■, 1. 

" ' These estimates of internal reliability are largely/ furjctioh of test • ; 

lenglii and the homogeneity of test content Since^^ : , 

.of the same length and • involved' the saine sanpling of thfe eontent. domain the . , 
^ ' lower intemdi reliability observed on the double format subtests suggests that 

the dotible multiple- choice questions may have increased the variance attributable ' ■ 
. to factors which:are indep«ident of subjf^ts; tontent master/. The low internal- 
reliability displayed by the double format* sttbtests , coi;^led with the observed 
difference in item discriinination, seems consistent with the conjectural analysi^^^^ 

. that the double format questions may produce greater variance due to subject • 
differences in reading and reasoning skills or test-wiseness,' . . ; . 

„ Validity . , .,.''..■■,,„••■ , " \ ■ *f ' • 

significant difference was observed between the two fooiats in -regfird 



*W'coi¥nirr«^ the tiouble-foTinat *• 

•.. subtests correliateii .S4*wi'th students^ actual course totals^" This ladc/of 
a differ^ce in concurrent validity seems .inconsistent vdth the differences 
v'-^^^^^ regard to' item discrumiatibn an4 reliability. - However, this 



->v. ... 



J - 



C%apparent inbonsisteney may liave a reasonable, explanation. The double -format 

•^:/' , . ' . ' ■ -^^ .. ^ •" * ; ■ ' . ■ 

>iquestion? may introduce additional- sources of variance which happen to be 
highly related" to the? criterion variable oJ course laiowiedge, Foif instance 

/ ■ , „ ^ . • ■ • . 

if ;rea<^g and rtasoning skills are hig^y correlated with course mastery 
(an intuitively plausible assuinption) , 'then additional variance attril)utable 
to these subject characteristics would not necessarily lower the validity of 
the doxjble-foTinat Subtests. ' ; 

Overall, the'' pattern of results suggests that" there fiay be sane interesting 
j(P differences between double and single- format questions which merit &thet 
research? *The implicit asstrotion. that the two formats are equivalent was not 
supported by the data. In addition to the expect^., and relatlvel^Stbnocuous, 
distr^anc/ in difficulty level.,: more distrubing differences in item. , ' ^ 
' di^briminati^ anci test^r^lialpilitjy were found. These differences suggest that 
■ double multiple-choice questioiis may generate additional variance in test 

scores which is not attributable, to differences in ^content masterry. Insofar as 
this addttipnal variance may largely reflect differences amorig' subjects in 
ainportant" cognitive skills such as reading and reasoning, it would not be 

-.particularly ^problematical on aptitude tests such as the LSAT. In contrast, . _ 

on classroom tests- or licmsing exans , which are int^ded to roeasiure mastery of 
^^eciificiintfoTination, these sources of vatiance would 6^ represent an 
increment in error variance.' However, in view of the failure to observe a 



. , . V - difiFerence in concurrent validity- on the classrocnifi test used .'in- the Bresent 
stiidy," aAy., alarming, asserim^ about* the differential effectiv^ess of the 



(V , . . . Si^: formats w^ , 
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